{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PCA Analysis in Python's using sklearn\n",
    "\n",
    "This notebook serves to discuss what is actually occuring behind the scenes in sklearn when the decomposition.pca package is being used.\n",
    "\n",
    "Note that PCA is commonly used to try and reduce the *feature* space.  There is a vast literature available on PCA analysis and we will not give a full account here.  The wikipedia page is a good first look: https://en.wikipedia.org/wiki/Principal_component_analysis\n",
    "\n",
    "In a simplified nutshell, we will transform the feature space into a new space, where the space is determined by directions of maximum variance.  This will allow us to drop directions of littler variance as they will contribute relatively little to the actual feature space.  \n",
    "\n",
    "##### Two excellent references:\n",
    "1. [Machine Learning: A Probabalistic Method](https://mitpress.mit.edu/books/machine-learning-0), *by Kevin P. Murphy* (he was a senior Research Scientist at Google in early days)\n",
    "2. [The Elements of Statistical Learning: Data Mining, Inference and Prediction](https://web.stanford.edu/~hastie/ElemStatLearn/), *by Hastie et. al.* (authors are from CS and Stats departments at Stanford)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-1, -1],\n",
       "       [-2, -1],\n",
       "       [-3, -2],\n",
       "       [ 1,  1],\n",
       "       [ 2,  1],\n",
       "       [ 3,  2]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example data from sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html\n",
    "X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Create the PCA fit using sklearn function\n",
    "\n",
    "We will see what is going on behind the scenes below.  For the purposes of this example, we keep all components so that we can fully reconstruct the original parameter matrix, $X$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,\n",
       "  svd_solver='auto', tol=0.0, whiten=False)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca = PCA(n_components=2)\n",
    "pca.fit(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Transform the data into new space using sklearn built in function\n",
    "\n",
    "Formally, we are projected the original parameters onto the new space defined by directions of maximum variance (where the directions are orthogonal to eachother)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.38340578,  0.2935787 ],\n",
       "       [ 2.22189802, -0.25133484],\n",
       "       [ 3.6053038 ,  0.04224385],\n",
       "       [-1.38340578, -0.2935787 ],\n",
       "       [-2.22189802,  0.25133484],\n",
       "       [-3.6053038 , -0.04224385]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# representation of X in transformed space, ie, projection of X onto new basis\n",
    "Z = pca.transform(X)\n",
    "Z"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### The new space is represented by a basis, which happen to be the eigenvectors\n",
    "\n",
    "these are the eigenvectors (directions) for the transformed data in the reduced space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.83849224, -0.54491354],\n",
       "       [ 0.54491354, -0.83849224]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.components_ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### In solving the PCA problem, the eigenvectors are constructed to be orthonormal\n",
    "\n",
    "that is, $ \\vec{e}_i \\cdot  \\vec{e}_j = 0$ when $j \\ne i$ and $ \\vec{e}_i \\cdot  \\vec{e}_j = 1$ when $j = i$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0\n",
      "1.0\n"
     ]
    }
   ],
   "source": [
    "print(np.dot(pca.components_[:,0],pca.components_[:,1]))\n",
    "print(np.dot(pca.components_[:,0],pca.components_[:,0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Singular values\n",
    "\n",
    "We will say more about these below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 6.30061232,  0.54980396])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.singular_values_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Transform from new space back to the original parameter space\n",
    "\n",
    "projection of new basis representation of $X$ back to original basis representation of $X$, which recovers original data (when all components used)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-1., -1.],\n",
       "       [-2., -1.],\n",
       "       [-3., -2.],\n",
       "       [ 1.,  1.],\n",
       "       [ 2.,  1.],\n",
       "       [ 3.,  2.]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pca.inverse_transform(Z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Above, we have shown the full deconstruction and reconstruction of X when using all components.\n",
    "\n",
    "We will now walk through two separate calculations using some linear algebra (which is what sklearn functions are actually doing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### We will compute the Covariance matrix $C$ and corresponding eigenvectors and eigenvalues"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "C = 1/(X.shape[0])*np.dot(X.T,X)\n",
    "w, v = np.linalg.eig(C) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### The components output from sklearn is simply the eigenvalues of the covariance matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.83849224, -0.54491354],\n",
       "       [ 0.54491354,  0.83849224]])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "v"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### The eigenvalues do not show up explicitly in the sklearn object, but are nothing more than the (scaled) square of the singular values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 6.30061232,  0.54980396])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sqrt(w*(X.shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### you can calculate your eigenvectors with the PCA outputs\n",
    "\n",
    "$ Z = XV$ where $V'$ = pca.components_ array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.38340578,  0.2935787 ],\n",
       "       [ 2.22189802, -0.25133484],\n",
       "       [ 3.6053038 ,  0.04224385],\n",
       "       [-1.38340578, -0.2935787 ],\n",
       "       [-2.22189802,  0.25133484],\n",
       "       [-3.6053038 , -0.04224385]])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Z1 = np.dot(X,pca.components_.T)\n",
    "Z1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# SVD can be recovered: X = U*sig*V'\n",
    "sig_inv = np.linalg.inv(np.eye(2)*pca.singular_values_)\n",
    "\n",
    "U = np.dot(Z1,sig_inv) \n",
    "U"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# check U orthonormal\n",
    "print(np.dot(U[:,0],U[:,1]))\n",
    "np.linalg.norm(np.dot(U[:,0],U[:,1])) < 10**-10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### map back to original space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "Xhat = np.dot(Z1,pca.components_)\n",
    "Xhat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3>Now use Singular Value Decomposition (SVD) to do same thing without using sklearn wrapper</h3>\n",
    "\n",
    "$ X = U\\Sigma V'$ is the common SVD representation, where $U$ and $V$ are unitary, and $\\Sigma$ is diagonal.  Then, we have,\n",
    "\n",
    "$Z := XV = U\\Sigma $ and clearly, to recover $X$, we have $X = ZV' = XVV'$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "u, s, vh = np.linalg.svd(X, full_matrices=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(6, 6)\n",
      "(2, 2)\n",
      "(2, 2)\n"
     ]
    }
   ],
   "source": [
    "print(u.shape)\n",
    "print(np.diag(s).shape)\n",
    "print(vh.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.21956688,  0.53396977, -0.48030985,  0.45219595,  0.02811389,\n",
       "         0.48030985],\n",
       "       [-0.35264795, -0.45713538, -0.30371038, -0.31508521,  0.61879559,\n",
       "         0.30371038],\n",
       "       [-0.57221483,  0.07683439,  0.75680405,  0.17257785,  0.0706181 ,\n",
       "         0.24319595],\n",
       "       [ 0.21956688, -0.53396977,  0.03329824,  0.79735166,  0.1693501 ,\n",
       "        -0.03329824],\n",
       "       [ 0.35264795,  0.45713538,  0.20989771,  0.03007049,  0.7600318 ,\n",
       "        -0.20989771],\n",
       "       [ 0.57221483, -0.07683439,  0.24319595, -0.17257785, -0.0706181 ,\n",
       "         0.75680405]])"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "u"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 6.30061232,  0.54980396])"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 6.30061232,  0.        ],\n",
       "       [ 0.        ,  0.54980396],\n",
       "       [ 0.        ,  0.        ],\n",
       "       [ 0.        ,  0.        ],\n",
       "       [ 0.        ,  0.        ],\n",
       "       [ 0.        ,  0.        ]])"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# full representation of singular values\n",
    "S = np.zeros((6, 2))\n",
    "S[:2, :2] = np.diag(s)\n",
    "S"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.83849224,  0.54491354],\n",
       "       [ 0.54491354, -0.83849224]])"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Create basis for transformed space, ie, create $Z$.  Note that this will equal sklearn up to order, to get perfect match, we would order these based on largest singular value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-1.38340578,  0.2935787 ],\n",
       "       [-2.22189802, -0.25133484],\n",
       "       [-3.6053038 ,  0.04224385],\n",
       "       [ 1.38340578, -0.2935787 ],\n",
       "       [ 2.22189802,  0.25133484],\n",
       "       [ 3.6053038 , -0.04224385]])"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Z2 = np.dot(X,vh.T)\n",
    "Z2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Map back to original paramter space, ie, recover X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-1., -1.],\n",
       "       [-2., -1.],\n",
       "       [-3., -2.],\n",
       "       [ 1.,  1.],\n",
       "       [ 2.,  1.],\n",
       "       [ 3.,  2.]])"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Xhat1 = np.dot(Z2,vh)\n",
    "Xhat1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concluding Remarks\n",
    "\n",
    "In using PCA Analysis, to reduce dimension, we simply start removing eigenvectors that correspond to 'small' eigenvalues, then proceed with the same calculation.  \n",
    "\n",
    "Or, in terms of [SVD](https://en.wikipedia.org/wiki/Singular-value_decomposition), remove singular values that are 'small' and their corresponding singular vectors.  \n",
    "\n",
    "In either case, you proceed with the calculations above with the reduced matrices / vectors.\n",
    "\n",
    "#### That's it, now you're an expert in the PCA done by sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
